Ph.D. Research Project: Developing Non-Intrusive Hybrid Fault Tolerance Techniques to Detect Errors in Multiple Processors Systems
نویسنده
چکیده
The use of fault-tolerance structures in multiple processors systems due to the fact that it is almost impossible to manufacture integrated circuits without any defect in nanometer technologies [1]. As a result, the use of fault tolerant methods is crucial to allow that circuits with some amount of defects still reach the market, increasing yield and the lifetime of a chip. A classical example comes from DRAM circuits, where defects are compensated by the use of spare rows and columns. With the technology introduced in the past decade, the number of expected defects in high-density circuits is increasing, and fault-tolerant techniques able to detect and correct multiple faults are very expensive, in terms of area, power and performance. Aiming at reducing the overhead cost, fault tolerance features should be turn on only in the exact location of defects. In this way, fault tolerance structures would not penalize the circuit in power and/or performance in case a fault is not present. Defects can have permanent effect, such as stuck-at, shortcut or open signals, or still show intermittent effects, like crosstalk between interconnection lines. For the above mentioned classes of defects, detection and diagnoses can be developed during manufacturing test, and off-line tests also can run during the life time of the circuit [2, 3, 4]. So, with the information of fault locations, a mechanism to deactivate a defective component and turn on a fault tolerance feature can be used. For example, in a multiple processor system, once a defect or failure has been detected in a microprocessor, the software application can be mapped on the remaining hardware components of a multiprocessor circuit [5]. Unfortunately, this simple deactivation approach cannot deal with faulty router or faulty link in the NoC, unless the NoC is modified to be able to adapt itself in the presence of faults. In this way, we propose methods to detect faults in multiple processors systems.
منابع مشابه
An approach to fault detection and correction in design of systems using of Turbo codes
We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...
متن کاملReaching Fault Diagnosis Agreement under a Hybrid Fault Model
ÐThe goal of the fault diagnosis agreement (FDA) problem is to make each fault-free processor detect/locate a common set of faulty processors. The problem is examined on processors with mixed fault model (also referred to as hybrid fault model). An evidence-based fault diagnosis protocol is proposed to solve the FDA problem. The proposed protocol first collects the messages which have accumulat...
متن کاملSoftware-Based Fault Recovery via Adaptive Diversity for COTS Multi-Core Processors
The ever growing demands of embedded systems to satisfy high computing performance and cost efficiency lead to the trend of using commercial off-the-shelf hardware. However, due to their highly integrated design they are becoming increasingly susceptible to hardware errors (e.g. caused by radiation-induced soft-errors or wear-out effects). Since such faults cannot be fully prevented, systems ha...
متن کاملComputing and Reducing Transient Error Propagation in Registers
Recent research indicates that transient errors will increasingly become a critical concern in microprocessor design. As embedded processors are widely used in reliability-critical or noisy environments, it is necessary to develop cost-effective fault-tolerant techniques to protect processors against transient errors. The register file is one of the critical components that can significantly af...
متن کاملرویکردی برای حفاظت از عملیات های پردازش داده در سیستم های محاسباتی با استفاده از کدهای کانولوشن
Abstract We present a framework for algorithm-based fault tolerance methods in the design of fault tolerant computing systems. The ABFT error detection technique relies on the comparison of parity values computed in two ways. The parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs. Number data proc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012